Retrieval-Augmented Generation — RAG — has become the dominant enterprise AI architecture pattern for a simple reason: it solves the most critical limitation of large language models for business use. LLMs have a knowledge cutoff and no access to your proprietary data. RAG bridges that gap, giving models access to your documents, databases, and knowledge bases at inference time. Done well, it enables AI systems that are accurate, current, and auditable. Done poorly, it produces AI that confidently returns wrong answers.
This guide covers what it actually takes to build a production RAG system — not a demo, not a notebook, but a system that handles real enterprise data at scale and returns reliably useful results.
The RAG Architecture Stack
A production RAG system has five core components, each with real engineering decisions:
- Ingestion pipeline: How you get documents into the system. This includes parsing (PDF, Word, HTML, structured data), chunking strategy, and metadata extraction.
- Embedding model: Converts text chunks into vector representations. Your choice here significantly affects retrieval quality.
- Vector store: Stores and indexes the embeddings. Options range from hosted services (Pinecone, Weaviate Cloud) to self-hosted (Qdrant, pgvector).
- Retrieval layer: The query-time logic that finds relevant chunks. Naive cosine similarity works poorly at scale — hybrid search is the standard.
- Generation layer: The LLM that synthesises retrieved context into a response. With GPT-5 and Gemini Ultra 2, this layer is increasingly capable — but only if what you retrieve is high quality.
Where Most Enterprise RAG Systems Fail
The most common failure point in enterprise RAG is the chunking strategy. Most teams default to fixed-size chunking (splitting documents every 512 or 1024 tokens) because it's simple to implement. But fixed-size chunking frequently splits logical units — a paragraph, a step in a process, a product specification — in ways that destroy the semantic coherence needed for good retrieval.
Better approaches include semantic chunking (splitting at natural linguistic boundaries), hierarchical chunking (creating both summary-level and detail-level chunks for the same content), and document-structure-aware chunking (treating headers, tables, and lists as their own semantic units). The difference in retrieval quality between naive fixed chunking and well-designed semantic chunking is typically 20–35% on standard benchmarks.
Embedding Model Selection
The embedding model market has matured significantly. For most enterprise use cases, the decision comes down to three options:
- OpenAI text-embedding-3-large: Strong general performance, 3072 dimensions, $0.13/million tokens. Good default for English-heavy enterprise content.
- Cohere Embed v4: Best-in-class for multilingual content and for datasets with mixed text and tabular data. Critical consideration for Southeast Asian enterprises with Vietnamese or Filipino language content.
- Local models (BGE-M3, E5-mistral-7b): Self-hosted options that eliminate per-token costs at scale and keep data on-premises. Worth evaluating for high-volume or data-sensitive deployments.
Hybrid Search: The Production Standard
Pure vector similarity search produces poor results for many enterprise queries — particularly precise lookups (product codes, contract numbers, names) where exact keyword matching outperforms semantic search. The production standard in 2026 is hybrid search: running both dense vector search and sparse BM25 keyword search in parallel, then combining results using Reciprocal Rank Fusion (RRF).
Qdrant and Weaviate both support hybrid search natively. PostgreSQL with pgvector supports it via a combination of vector search and full-text search. Most enterprise teams see 15–25% retrieval quality improvement from hybrid search compared to vector-only approaches.
Evaluation: The Piece Nobody Wants to Do
The most common reason enterprise RAG systems drift from "working in testing" to "unreliable in production" is the absence of a systematic evaluation framework. Before going to production, you need:
- A golden dataset of at least 100 representative questions with validated correct answers
- Automated retrieval evaluation metrics: recall@k (is the right document in the top-k retrieved?), precision@k, Mean Reciprocal Rank
- Generation evaluation: faithfulness (does the answer reflect the retrieved context?), answer relevance, hallucination rate
- Continuous monitoring: alerting when retrieval or generation quality drops below threshold in production
RAGAs (Retrieval Augmented Generation Assessment) is now the standard open-source framework for this evaluation layer. It integrates with LangChain and LlamaIndex, making it relatively low-friction to add to an existing pipeline.
For enterprises integrating RAG with Odoo or other ERP systems — building knowledge bases from ERP documentation, product data sheets, or support ticket histories — the evaluation step is non-negotiable. The stakes of a wrong answer in a business context are real, and the only way to manage that risk is systematic measurement.